Introduction


In this notebook, we'll reinforce our understanding of the skip-gram neural network architecture by implementing it from scratch. We'll stop at just the feed-forward implementation--that is, we'll be able to evaluate the network on an input, but we won't be implementing back-propagation from scratch here.

Contents


Feed-Forward Implementation


We're going to take the weights from a pre-trained model and execute a forward pass on the network as an illustration of the architecture.

Pre-trained Model


To try something new, we'll use a different pre-trained model in this notebook. The model we'll be using comes from a nice code sample by Kavita Ganesan. She trains word2vec (using gensim) on a dataset of hotel reviews (~250k reviews, ~41.5M words) and chose to train 150 features (versus the 300 in the Google News model). Her model gets great word representations for adjectives like "dirty", "polite", and others that you'd expect to find in reviews.

You can find her code here if you're interested, but for this notebook I've exported the vocabulary and trained model parameters and made them available to download. Run the next cell to download them, or download them manually from the following links:

In [18]:
import os
import requests

# Create the /data/ subdirectory if needed.
if not os.path.exists('./data/'):
    print("Making directory /data/")
    os.makedirs('./data/')

# URLs and filenames for the data.    
files = [
    ("https://drive.google.com/file/d/1s-Ndz2PcHMVFOZ8AsgIb8WaEmlNnecQP", 
     "./data/projection_weights.npy"),
    
    ("https://drive.google.com/file/d/10mpIhCU6FJdGjgGynU_qmphQt2rDTtDS", 
     "./data/output_weights.npy"),
    
    ("https://drive.google.com/file/d/1goIP_NmKI3D1bprQVFNSP7zOwebt5dna", 
     "./data/index2word.p"),
    
    ("https://drive.google.com/file/d/1hV-VEscKJTJFTWm5Kkl1XRY26YvH-eMp", 
     "./data/word2index.p"),
]

print("Downloading files...")

# Download each of the files, about 85MB total.
for file in files:
    print("    " + file[1])
    
    r = requests.get(file[0], allow_redirects=True)
    open(file[1], 'wb').write(r.content)
    
print("Done.")
Downloading files...
    ./data/projection_weights.npy
    ./data/output_weights.npy
    ./data/index2word.p
    ./data/word2index.p
Done.

Read in the vocabulary and the weights.

In [1]:
%%time

import pickle
import numpy as np

print("\nLoading vocabulary...")

# Read in the list of vocabulary words.
vocab_list = pickle.load(open('./data/index2word.p', 'rb'))

# Read in the dictionary which maps words to their indeces 
# in the weight matrices.
vocab = pickle.load(open('./data/word2index.p', 'rb'))

# Report the size of the vocabulary.
vocab_size = len(vocab_list)
print('\n    Vocabulary is {:,} words.\n'.format(vocab_size))

print("\nLoading weight matrices...\n")
# Load the weight matrices for the projection layer and the
# output layer.
W_proj = np.load('./data/projection_weights.npy')
W_out = np.load('./data/output_weights.npy')
Loading vocabulary...

    Vocabulary is 70,537 words.


Loading weight matrices...

Wall time: 4.57 s

Notation


We're going to run through the network one layer at a time, and see what comes out. We'll run it for my favorite word, "couch". :)

To help interpret the linear algebra in a neural network, I like to print the matrix dimensions at each step.

Below are a couple helper functions that I'll use to do this.

In [3]:
def shape_to_str(shape):
    '''
    Prints out the dimensions of a matrix neatly.
    '''
    return "[{:,}  x  {:,}]".format(shape[0], shape[1])


def print_matrix_mult(x, x_name, y, y_name, z_name):
    '''
    Prints out the dimensions of a matrix multiplication,
    x * y = z
    '''
    z_shape = (x.shape[0], y.shape[1])
    
    print("%16s  *  %16s  =  %10s" % (x_name, y_name, z_name))
    
    print("%16s  *  %16s  =  %10s" % (shape_to_str(x.shape), 
                                      shape_to_str(y.shape), 
                                      shape_to_str(z_shape)))

# Let's use the first function to print out the weight dimensions.
print('')
print('    Projection layer weights: %s' % shape_to_str(W_proj.shape))
print('    Output layer weights: %s\n' % shape_to_str(W_out.shape))
    Projection layer weights: [70,537  x  150]
    Output layer weights: [70,537  x  150]


I'm using the following variable name conventions. For each of these, I append the name of the layer, "proj" or "out".

  • W - Neuron parameters ("weights")
  • z - Dot product between the output of the previous layer and the next layer's weights.
    • e.g., z_out = a_proj * W_out
  • a - Activation values for a layer.
    • e.g., a_out = softmax(z_out)

Run Layer-by-Layer


Step 1 - Input Layer

Create our input vector--a one-hot vector for the word "couch".

Side note: You may recall from the book that the one-hot vector isn't really necessary. We'll use it for now, though, and come back later to prove that it's unneeded.

In [4]:
# Initialize the one-hot as a row vector with all zeros.
one_hot = np.zeros( shape=(1, vocab_size) )

# Look up the index for "couch" and set it to 1.
one_hot[0, vocab["couch"].index] = 1

Step 2 - Projection Layer

Feed the input vector into the first layer, which if you recall, has no activation function!

In [5]:
print("\nInput --> Projection Layer")
print_matrix_mult(one_hot, "one_hot", 
                  W_proj, "W_proj", 
                  "z_proj")

# Multiply the one hot vector with the projection layer weights.
z_proj = np.dot(one_hot, W_proj)

# There is no activation function on the projection layer, so the
# output of this layer is just the dot-product from above.
a_proj = z_proj
Input --> Projection Layer
         one_hot  *            W_proj  =      z_proj
  [1  x  70,537]  *  [70,537  x  150]  =  [1  x  150]

Step 3 - Output Layer

Feed the output of the projection layer (which is actually just the word vector for "couch"!) into the output layer.

The output layer uses the softmax activation function. This function takes the exponential of a neuron's output, and divides it by the exponentials of all the output neurons.

$ S \left( x_i\right) = \frac{\displaystyle e^{x_i}}{\displaystyle\sum^n_{i=1}{e^{x_i}}} $

In [6]:
print("\nProjection Layer --> Output Layer\n")
print_matrix_mult(a_proj, "a_proj", 
                  W_out.T, "W_out'", 
                  "z_out")

# Multiply the output of the projection layer with the 
# output layer weights.
z_out = np.dot(a_proj, W_out.T)

print("\nOutput Activation...\n")

# Apply the softmax function to the outputs.
a_out = np.exp(z_out) / np.sum(np.exp(z_out))

print("%16s" % "a_out")
print("%16s" % shape_to_str(a_out.shape))
Projection Layer --> Output Layer

          a_proj  *            W_out'  =       z_out
     [1  x  150]  *  [150  x  70,537]  =  [1  x  70,537]

Output Activation...

           a_out
  [1  x  70,537]

Inspecting Network Output


Now we have the distribution of context words for "couch"! Let's explore it!

Does the distribution sum to 1?

In [7]:
# Does it sum to 1.0 as it should?
print("Sum of outputs: %.2f" % np.sum(a_out))
Sum of outputs: 1.00

If I find the word "couch" in some text, what are the most likely words to find around it?

In [9]:
a_out = a_out.flatten()

# Sort the activations but return the sorted *indeces*.
# This sorts them in ascending order.
indeces = np.argsort(a_out)

# Reverse the order to descending with some ugly Python.
indeces = indeces[::-1]

print('%20s   %s' % ('--Word--', '--Output--'))

# For the most likely context words...
for i in range(0, 10):
    # Get the word index for result 'i' (in reverse order).
    word_index = indeces[i]
    
    # Lookup the word.
    word = vocab_list[word_index]
    
    # Lookup the output value.
    a_out_i = a_out[word_index]
    
    # Print the word and its output.
    print('%20s   %.4G' % (word, a_out_i))
            --Word--   --Output--
             pullout   0.6606
             sleeper   0.3237
               couch   0.008325
                sofa   0.002821
             foldout   0.00166
                pull   0.001434
              chairs   0.0004315
           sectional   0.0003964
            cushions   9.422E-05
            loveseat   9.185E-05

Those are very reasonable results! Remember that this model was trained on hotel reviews, so "couch" is a very relevant word in this model. You can imagine how all of the above might appear near couch: "pullout couch", "sleeper couch", "foldout couch", "sectional couch", "couch cushions", "couch and loveseat".

You can even imagine how the word "couch" might appear near itself (the original word2vec C code doesn't appear to do anything to prevent training the input word as a context word).

Additional Topics


Is the one-hot vector necessary?


Recall from the book that the one-hot vector is really only part of the mathematical formulation, and not at all necessary in the implementation. Just to prove this to ourselves, we'll select the word vector couch, and observe that it's identical to the output of the projection layer.

In [10]:
# Look up the word vector for couch.
vec_couch = W_proj[vocab["couch"].index, :]

# Compare the word vector for "couch" with the output of 
# the projection layer using the one-hot vector. Calculate
# the distance between them to check for equality.
print("Distance between `z_proj` and `vec_couch` = %.2f" % 
          np.linalg.norm(vec_couch - z_proj.flatten()))
Distance between `z_proj` and `vec_couch` = 0.00

Why no activation function on the projection layer?


The hidden layer of this architecture is referred to as a "projection" layer because it has no activation function, as normal neural network layers do.

Why not? I have to admit I don't know what would happen if you added an activation function, but I can at least demonstrate why it could make sense that an activation fucntion isn't needed.

Let's look at a single neuron from the projection layer to try to understand this. Remember, a single neuron in this network represents a single word vector feature, not a vocabulary word... There are 150 projection layer neurons in this network, and each neuron has 70,537 weights (one for each word in the vocabulary).

In [11]:
# Select the weights of an arbitrary neuron, number 12.
# (The brackets around "[12]" force numpy to preserve 
# it as a 2D vector)
neur_12 = W_proj[:, [12]]

# Print the dimensions of this step.
print("Feed forward neuron 12...")
print_matrix_mult(one_hot, "one_hot", 
                  neur_12, "neur_12", "z1_12")

# Feed forward through this neuron.
z_proj_12 = np.dot(one_hot, neur_12)

# What's the output?
print("")
print("  Neuron 12 output: %f" % z_proj_12)
print("'Couch' feature 12: %f" % 
          W_proj[vocab["couch"].index, 12])
Feed forward neuron 12...
         one_hot  *           neur_12  =       z1_12
  [1  x  70,537]  *    [70,537  x  1]  =   [1  x  1]

  Neuron 12 output: 1.440549
'Couch' feature 12: 1.440549

Normally a hidden layer neuron takes a linear combination of the input vector and the neuron's weights. We then have to introduce a non-linearity (such as sigmoid or ReLu) to give the network non-linear properties.

Here, however, we're doing nothing more than selecting a value form the neuron's weights, unmodified. This "projection" layer doesn't actually do any computation, so the addition of an activation function seems pointless.

Note that, for the same reason, the hidden layer doesn't include a bias term, either.

You could perhaps think of the word2vec architecture as a single softmax layer, with the word vectors being the inputs to the network. In this case, however, we backpropagate to the training samples, and modify the inputs as part of the training!

(It would be an interesting exercise, I think, to apply this to something like MNIST image classification--how would the images change if you backprogagated to the image vectors themselves?)